DPOpenHermes 7B v2 is the second RL fine-tuned model based on OpenHermes-2.5-Mistral-7B, utilizing Direct Preference Optimization (DPO) for reinforcement learning with the Intel/orca_dpo_pairs and allenai/ultrafeedback_binarized_cleaned preference datasets.
Large Language Model
Transformers English